Search CORE

464 research outputs found

S3Aug: Segmentation, Sampling, and Shift for Action Recognition

Author: Sugiura Taiki
Tamaki Toru
Publication venue
Publication date: 23/10/2023
Field of study

Action recognition is a well-established area of research in computer vision. In this paper, we propose S3Aug, a video data augmenatation for action recognition. Unlike conventional video data augmentation methods that involve cutting and pasting regions from two videos, the proposed method generates new videos from a single training video through segmentation and label-to-image transformation. Furthermore, the proposed method modifies certain categories of label images by sampling to generate a variety of videos, and shifts intermediate features to enhance the temporal coherency between frames of the generate videos. Experimental results on the UCF101, HMDB51, and Mimetics datasets demonstrate the effectiveness of the proposed method, paricularlly for out-of-context videos of the Mimetics dataset

arXiv.org e-Print Archive

Joint learning of images and videos with a single Vision Transformer

Author: Shimizu Shuki
Tamaki Toru
Publication venue
Publication date: 21/08/2023
Field of study

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.Comment: MVA2023 (18th International Conference on Machine Vision Applications), Hamamatsu, Japan, 23-25 July 202

arXiv.org e-Print Archive

Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Author: Hashiguchi Ryota
Tamaki Toru
Publication venue
Publication date: 01/04/2022
Field of study

We propose Multi-head Self/Cross-Attention (MSCA), which introduces a temporal cross-attention mechanism for action recognition, based on the structure of the Multi-head Self-Attention (MSA) mechanism of the Vision Transformer (ViT). Simply applying ViT to each frame of a video frame can capture frame features, but cannot model temporal features. However, simply modeling temporal information with CNN or Transfomer is computationally expensive. TSM that perform feature shifting assume a CNN and cannot take advantage of the ViT structure. The proposed model captures temporal information by shifting the Query, Key, and Value in the calculation of MSA of ViT. This is efficient without additional coinformationmputational effort and is a suitable structure for extending ViT over temporal. Experiments on Kineitcs400 show the effectiveness of the proposed method and its superiority over previous methods.Comment: 9 page

arXiv.org e-Print Archive